- 
                Notifications
    You must be signed in to change notification settings 
- Fork 12
WIP: Hacky(!) but faster nbody implementations #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
WIP: Hacky(!) but faster nbody implementations #44
Conversation
| const PAIRS = Tuple((i,j) for i = 1:5, j = 1:5 if j > i) | ||
|  | ||
| struct Bodies | ||
| x::MMatrix{NBODIES, 3, Float64} | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cheating imo. Are other languages really encoding the number of bodies as a compile time constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much so, e.g.:
I don't have a strong opinion about this, although it seems inevitable that benchmark implementations slowly slide towards more compile-time computation in the absence of clear, explicit guidelines. It's also quite arguable whether hand-vectorizing everything shows much, other than how easily a language lets you call intrinsics.
Julia's well-equipped to play this game though, especially through the macro facilities. Up to you whether you think it's a net benefit to the community to showcase this.
846e6b5    to
    f6ddb3f      
    Compare
  
    | Very cool. I think we can probably polish it to be more idiomatic Julia over time. I've been trying to get this one faster using simd intrinsics on my machine for a while now and mostly failing. My preference would be to just use NTuples with unsafe_store! for now. I think getting AOT compilation working consistently on the benchmarks-game machine might be a ways off. And it might not be accepted at all depending on whether the maintainer wants to deal with the headache. | 
|  | ||
| rsqrt(f::__m128) = ccall("llvm.x86.sse.rsqrt.ps", llvmcall, __m128, (__m128, ), f); | ||
| _mm_cvtpd_ps(f::__m128d) = ccall("llvm.x86.sse2.cvtpd2ps", llvmcall, __m128, (__m128d, ), f); | ||
| _mm_cvtps_pd(f::__m128) = llvmcall(("", " | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How difficult would it be to contribute these to SIMD.jl instead of including them inline here? It seems like rsqrt at least should be relatively trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@non-Jedi I think it's doable, but probably beyond my current Julia abilities.
I did find that LLVM will only emit the rsqrt assembly instructions for single and 4x floats (under -ffast-math and SSE3, I believe), see https://godbolt.org/z/v5DF_s . It does a single Newton-Raphson step with FMA.
The tricky bit here is that we would need to explicitly llvmcall the double to float conversions and word packing (for 2x case), as LLVM doesn't help us out here.
Also, I think SIMD.jl covers a subset of Julia math functions - would it be a bit strange to add a SIMD primitive that isn't in Base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess rsqrt is only simple if you define it for only Vec{4,Float32}, which the current architecture of SIMD.jl doesn't really allow. To me the functionality still belongs in there even if it's not a Base Julia function. I'll open an issue on the repo and see what the maintainer thinks.
| unsafe_store!(dx_v2d_ptr, dx3, k_v2d + NPAIRS) | ||
|  | ||
| dsq = dx1^2 + dx2^2 + dx3^2 | ||
| drsqrt = rsqrt_pd_newton(dsq) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried a version of this that uses all 4 available Float32 and only has 3 calls to _mm_rsqrt_ps instead of 5? It may be slower since it would require caching the dsq values and drsqrt values so you could break this single loop into 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet, will give it a try.
The double-single conversions and shuffling may end up being a bottleneck, at least for the competition's core2 target since they are only 2-wide; in particular cvtps_pd only expands a contiguous pair of floats into a contiguous pairs of doubles.
If we allow AVX, a more optimal implementation may end up using Vec4 coordinates + 4-wide rsqrt, e.g. switching between broadcasting across coordinates vs. across planets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 4-wide _mm_rsqrt_ps is slightly slower due to the extra caching involved (~4.3s vs. 4.15s on my machine).
Would you like me to clean this up this PR?
Here are a couple of messy implementations that, at least on my laptop with
--target-cpu=core2(the architecture of the actual BenchmarksGame test machine), beat the current ...simd.jlby about 60% and 40%:Would like some feedback before cleaning this up further (and getting too deep in this rabbit hole 🙂), in particular whether this is helpful for showing off the language, since the code is getting far from idiomatic Julia.
A few caveats:
StaticArrays), so this awaits Julia AOT (protype to compile julia scripts to exe #35) to show real gains; or I can switch usingNTuples with unsafe stores.Btw, the ...
unroll.jlfile has a hacky macro that fully unrolls some of the inner loops. This mimics how Rust #7 achieves its speedup:rustcis smart enough to automatically unroll the (outer) for loops insideadvance, e.g.rsqrtis seen 5 times in decompiled asm.I didn't go all the way to unrolling the stride-2 loop, but could be persuaded to hack something up just to see how much improvement can be found.
@KristofferC Thanks again for your help getting intrinsics working.